Search Results: "huber"

12 March 2015

Erich Schubert: The sad state of sysadmin in the age of containers

System administration is in a sad state. It in a mess.

I'm not complaining about old-school sysadmins. They know how to keep systems running, manage update and upgrade paths.

This rant is about containers, prebuilt VMs, and the incredible mess they cause because their concept lacks notions of "trust" and "upgrades".

Consider for example Hadoop. Nobody seems to know how to build Hadoop from scratch. It's an incredible mess of dependencies, version requirements and build tools.

None of these "fancy" tools still builds by a traditional make command. Every tool has to come up with their own, incomptaible, and non-portable "method of the day" of building.

And since nobody is still able to compile things from scratch, everybody just downloads precompiled binaries from random websites. Often without any authentication or signature.

NSA and virus heaven. You don't need to exploit any security hole anymore. Just make an "app" or "VM" or "Docker" image, and have people load your malicious binary to their network.

The Hadoop Wiki Page of Debian is a typical example. Essentially, people have given up in 2010 to be able build Hadoop from source for Debian and offer nice packages.

To build Apache Bigtop, you apparently first have to install puppet3. Let it download magic data from the internet. Then it tries to run sudo puppet to enable the NSA backdoors (for example, it will download and install an outdated precompiled JDK, because it considers you too stupid to install Java.) And then hope the gradle build doesn't throw a 200 line useless backtrace.

I am not joking. It will try to execute commands such as e.g.

/bin/bash -c "wget http://www.scala-lang.org/files/archive/scala-2.10.3.deb ; dpkg -x ./scala-2.10.3.deb /"

Note that it doesn't even install the package properly, but extracts it to your root directory. The download does not check any signature, not even SSL certificates. (Source: Bigtop puppet manifests)

Even if your build would work, it will involve Maven downloading unsigned binary code from the internet, and use that for building.

Instead of writing clean, modular architecture, everything these days morphs into a huge mess of interlocked dependencies. Last I checked, the Hadoop classpath was already over 100 jars. I bet it is now 150, without even using any of the HBaseGiraphFlumeCrunchPigHiveMahoutSolrSparkElasticsearch (or any other of the Apache chaos) mess yet.

Stack is the new term for "I have no idea what I'm actually using".

Maven, ivy and sbt are the go-to tools for having your system download unsigned binary data from the internet and run it on your computer.

And with containers, this mess gets even worse.

Ever tried to security update a container?

Essentially, the Docker approach boils down to downloading an unsigned binary, running it, and hoping it doesn't contain any backdoor into your companies network.

Feels like downloading Windows shareware in the 90s to me.

When will the first docker image appear which contains the Ask toolbar? The first internet worm spreading via flawed docker images?

Back then, years ago, Linux distributions were trying to provide you with a safe operating system. With signed packages, built from a web of trust. Some even work on reproducible builds.

But then, everything got Windows-ized. "Apps" were the rage, which you download and run, without being concerned about security, or the ability to upgrade the application to the next version. Because "you only live once".

Update: it was pointed out that this started way before Docker: Docker is the new 'curl sudo bash' . That's right, but it's now pretty much mainstream to download and run untrusted software in your "datacenter". That is bad, really bad. Before, admins would try hard to prevent security holes, now they call themselves "devops" and happily introduce them to the network themselves!

22 January 2015

Erich Schubert: Year 2014 in Review as Seen by a Trend Detection System

We ran our trend detection tool Signi-Trend (published at KDD 2014) on news articles collected for the year 2014. We removed the category of financial news, which is overrepresented in the data set. Below are the (described) results, from the top 50 trends (I will push the raw result to appspot if possible due to file limits).

I have highlighted the top 10 trends in bold, but otherwise ordered them chronologically.

Updated: due to an error in a regexp, I had filtered out too many stories. The new results use more articles.

January

2014-01-29: Obama's state of the union address

February

2014-02-07: Sochi Olympics gay rights protests

2014-02-08: Sochi Olympics first results

2014-02-19: Violence in Ukraine and Maidan in Kiev

2014-02-20: Wall street reaction to Facebook buying WhatsApp

2014-02-22: Yanukovich leaves Kiev

2014-02-28: Crimea crisis begins

March

2014-03-01: Crimea crisis escalates futher

2014-03-02: NATO meeting on Crimea crisis

2014-03-04: Obama presents U.S. fiscal budget 2015 plan

2014-03-08: Malaysia Airlines MH-370 missing in South China Sea

2014-03-08: MH-370: many Chinese on board of missing airplane

2014-03-15: Crimean status referencum (upcoming)

2014-03-18: Crimea now considered part of Russia by Putin

2014-03-21: Russian stocks fall after U.S. sanctions.

April

2014-04-02: Chile quake and tsunami warning

2014-04-09: False positive? experience + views

2014-04-13: Pro-russian rebels in Ukraine's Sloviansk

2014-04-17: Russia-Ukraine crisis continues

2014-04-22: French deficit reduction plan pressure

2014-04-28: Soccer World Cup coverage: team lineups

May

2014-05-14: MERS reports in Florida, U.S.

2014-05-23: Russia feels sanctions impact

2014-05-25: EU elections

June

2014-06-06: World cup coverage

2014-06-13: Islamic state Camp Speicher massacre in Iraq

2014-06-14: Soccer world cup: Spain surprisingly destoyed by Netherlands

July

2014-07-05: Soccer world cup quarter finals

2014-07-17: Malaysian Airlines MH-17 shot down over Ukraine

2014-07-18: Russian blamed for 298 dead in airline downing

2014-07-19: Independent crash site investigation demanded

2014-07-20: Israel shelling Gaza causes 40+ casualties in a day

August

2014-08-07: Russia bans food imports from EU and U.S.

2014-08-08: Obama orders targeted air strikes in Iraq

2014-08-20: IS murders journalist James Foley, air strikes continue

2014-08-30: EU increases sanctions against Russia

September

2014-09-05: NATO summit with respect to IS and Ukraine conflict

2014-09-11: Scottish referendum upcoming - poll results are close

2014-09-23: U.N. on legality of U.S. air strikes in Syria against IS

2014-09-26: Star manager Bill Gross leaves Allianz/PIMCO for Janus

October

2014-10-22: Ottawa parliament shooting

2014-10-26: EU banking review

November

2014-11-05: U.S. Senate and governor elections

2014-11-12: Foreign exchange manipulation investigation results

2014-11-17: Japan recession

December

2014-12-11: CIA prisoner and U.S. torture centers revieled

2014-12-15: Sydney cafe hostage siege

2014-12-17: U.S. and Cuba relations improve unexpectedly

2014-12-18: Putin criticizes NATO, U.S., Kiev

2014-12-28: AirAsia flight QZ-8501 missing

As you can guess, we are really happy with this result - just like the result for 2013 it mentiones (almost) all the key events.

There probably is one "false positive" there: 2014-04-09 has a lot of articles talking about "experience" and "views", but not all refer to the same topic (we did not do topic modeling yet).

There are also some events missing that we would have liked to appear; many of these barely did not make it into the top 50, but do appear in the top 100, such as the Sony cyberattack (#51) and the Fergusson riots on November 11 (#66).

You can also explore the results online in a snapshot.

16 January 2015

Erich Schubert: Year 2014 in Review as Seen by a Trend Detection System

January

2014-01-29: Obama's State of the Union address

February

2014-02-05..23: Sochi Olympics (11x, including the four below)

2014-02-07: Gay rights protesters arrested at Sochi Olympics

2014-02-08: Sochi Olympics begins

2014-02-16: Injuries in Sochi Extreme Park

2014-02-17: Men's Snowboard cross finals called of because of fog

2014-02-19: Violence in Ukraine and Kiev

2014-02-22: Yanukovich leaves Kiev

2014-02-23: Sochi Olympics close

2014-02-28: Crimea crisis begins

March

2014-03-01..06: Crimea crisis escalates futher (3x)

2014-03-08: Malaysia Airlines machine missing in South China Sea (2x)

2014-03-18: Crimea now considered part of Russia by Putin

2014-03-28: U.N. condemns Crimea's secession

April

2014-04-17..18: Russia-Ukraine crisis continues (3x)

2014-04-20: South Korea ferry accident

May

2014-05-18: Cannes film festival

2014-05-25: EU elections

June

2014-06-13: Islamic state fighting in Iraq

2014-06-16: U.S. talks to Iran about Iraq

July

2014-07-17..19: Malaysian airline shot down over Ukraine (3x)

2014-07-20: Israel shelling Gaza kills 40+ in a day

August

2014-08-07: Russia bans EU food imports

2014-08-20: Obama orders U.S. air strikes in Iraq against IS

2014-08-30: EU increases sanctions against Russia

September

2014-09-04: NATO summit

2014-09-23: Obama orders more U.S. air strikes against IS

Oktober

2014-10-16: Ebola case in Dallas

2014-10-24: Ebola patient in New York is stable

November

2014-11-02: Elections: Romania, and U.S. rampup

2014-11-05: U.S. Senate elections

2014-11-25: Ferguson prosecution

Dezember

2014-12-08: IOC Olympics sport additions

2014-12-11: CIA prisoner center in Thailand

2014-12-15: Sydney cafe hostage siege

2014-12-17: U.S. and Cuba relations improve unexpectedly

2014-12-19: North Korea blamed for Sony cyber attack

2014-12-28: AirAsia flight 8501 missing

13 January 2015

Erich Schubert: Big data predictions for 2015

My big data predictions for 2015:

Big data will continue to fail to deliver for most companies.
This has several reasons, including in particular: 1: lack of data to analyze that actually benefits from big data tools and approaches (and which is not better analyzed with traditional tools). 2: lack of talent, and failure to attract analytics talent. 3: stuck in old IT, and too inflexible to allow using modern tools (if you want to use big data, you will need a flexible "in-house development" type of IT that can install tools, try them, abandon them, without going up and down the management chains) 4: too much marketing. As long as big data is being run by the marketing department, not by developers, it will fail.
Project consolidation: we have seen hundreds of big data software projects the last years. Plenty of them on Apache, too. But the current state is a mess, there is massive redundancy, and lots and lots of projects are more-or-less abandoned. Cloudera ML, for example, is dead: superseded by Oryx and Oryx 2. More projects will be abandoned, because we have way too many (including much too many NoSQL databases, that fail to outperform SQL solutions like PostgreSQL). As is, we have dozens of competing NoSQL databases, dozens of competing ML tools, dozens of everything.
Hype: the hype will continue, but eventually (when there is too much negative press on the term "big data" due to failed projects and inflated expectations) move on to other terms. The same is also happening to "data science", so I guess the next will be "big analytics", "big intelligence" or something like that.
Less openness: we have seen lots of open-source projects. However, many decided to go with Apache-style licensing - always ready to close down their sharing, and no longer share their development. In 2015, we'll see this happen more often, as companies try to make money off their reputation. At some point, copyleft licenses like GPL may return to popularity due to this.

22 December 2014

Erich Schubert: Java sum-of-array comparisons

This is a follow-up to the post by Daniel Lemire on a close topic.

Daniel Lemire hat experimented with boxing a primitive array in an interface, and has been trying to measure the cost.

I must admit I was a bit sceptical about his results, because I have seen Java successfully inlining code in various situations.

For an experimental library I occasionally work on, I had been spending quite a bit of time on benchmarking. Previously, I had used Google Caliper for it (I even wrote an evaluation tool for it to produce better statistics). However, Caliper hasn't seen much updates recently, and there is a very attractive similar tool at openJDK now, too: Java Microbenchmarking Harness (actually it can be used for benchmarking at other scale, too).

Now that I have experience in both, I must say I consider JMH superior, and I have switched over my microbenchmarks to it. One of the nice things is that it doesn't make this distinction of micro vs. macrobenchmarks, and the runtime of your benchmarks is easier to control.

I largely recreated his task using JMH. The benchmark task is easy: compute the sum of an array; the question is how much the cost is when allowing different data structures than double[].

My results, however, are quite different. And the statistics of JMH indicate the differences may be not significant, and thus indicating that Java manages to inline the code properly.

adapterFor       1000000  thrpt  50  836,898   13,223  ops/s
adapterForL      1000000  thrpt  50  842,464   11,008  ops/s
adapterForR      1000000  thrpt  50  810,343    9,961  ops/s
adapterWhile     1000000  thrpt  50  839,369   11,705  ops/s
adapterWhileL    1000000  thrpt  50  842,531    9,276  ops/s
boxedFor         1000000  thrpt  50  848,081    7,562  ops/s
boxedForL        1000000  thrpt  50  840,156   12,985  ops/s
boxedForR        1000000  thrpt  50  817,666    9,706  ops/s
boxedWhile       1000000  thrpt  50  845,379   12,761  ops/s
boxedWhileL      1000000  thrpt  50  851,212    7,645  ops/s
forSum           1000000  thrpt  50  845,140   12,500  ops/s
forSumL          1000000  thrpt  50  847,134    9,479  ops/s
forSumL2         1000000  thrpt  50  846,306   13,654  ops/s
forSumR          1000000  thrpt  50  831,139   13,519  ops/s
foreachSum       1000000  thrpt  50  843,023   13,397  ops/s
whileSum         1000000  thrpt  50  848,666   10,723  ops/s
whileSumL        1000000  thrpt  50  847,756   11,191  ops/s

The postfix is the iteration type: sum using for loops, with local variable for the length (L), or in reverse order (R); while loops (again with local variable for the length). The prefix is the data layout: the primitive array, the array using a static adapter (which is the approach I have been using in many implementations in cervidae) and using a "boxed" wrapper class around the array (roughly the approach that Daniel Lemire has been investigating. On the primitive array, I also included the foreach loop approach (for(double v:array) ).

If you look at the standard deviations, the results are pretty much identical, except for reverse loops. This is not surprising, given the strong inlining capabilities of Java - all of these codes will lead to next to the same CPU code after warmup and hotspot optimization.

I do not have a full explanation of the differences the others have been seeing. There is no "polymorphism" occurring here (at runtime) - there is only a single Array implementation in use; but this was the same with his benchmark.

Here is a visualization of the results (sorted by average):
Result boxplots

As you can see, most results are indiscernible. The measurement standard deviation is higher than the individual differences. If you run the same benchmark again, you will likely get a different ranking.

Note that performance may - drastically - drop once you use multiple adapters or boxing classes in the same hot codepath. Java Hotspot keeps statistics on the classes it sees, and as long as it only sees 1-2 different types, it performs quite aggressive optimizations instead of doing "virtual" method calls.

25 November 2014

Erich Schubert: Installing Debian with sysvinit

First let me note that I am using systemd, so these things here are untested by me. See e.g. Petter's and Simon's blog entries on the same overall topic.

According to the Debian installer maintainers, the only accepted way to install Debian with sysvinit is to use preseeding. This can either be done at the installer boot prompt by manually typing the magic spell:

preseed/late_command="in-target apt-get install -y sysvinit-core"

or by using a preseeding file (which is a really nice feature I used for installing my Hadoop nodes) to do the same:

d-i preseed/late_command string in-target apt-get install -y sysvinit-core

If you are a sysadmin, using preseeding can save you a lot of typing. Put all your desired configuration into preseeding files, put them on a webserver (best with a short name resolvable by local DNS). Let's assume you have set up the DNS name d-i.example.com, and your DHCP is configured such that example.com is on the DNS search list. You can also add a vendor extension to DHCP to serve a full URL. Manually enabling preseeding then means adding

auto url=d-i

to the installer boot command line (d-i is the hostname I suggested to set up in your DNS before, and the full URL would then be http://d-i.example.com/d-i/jessie/./preseed.cfg. Preseeding is well documented in Appendix B of the installer manual, but nevertheless will require a number of iterations to get everything work as desired for a fully automatic install like I used for my Hadoop nodes.

There might be an easier option.
I have filed a wishlist bug suggesting to use the tasksel mechanism to allow the user to choose sysvinit at installation time. However, it got turned down by the Debian installer maintainers quire rudely in a "No." - essentially this is a "shut the f... up and go away", which is in my opinion an inappropriate to discard a reasonable user wishlist request.

Since I don't intend to use sysvinit anymore, I will not be pursuing this option further. It is, as far as I can tell, still untested. If it works, it might be the least-effort, least-invasive option to allow the installation of sysvinit Jessie (except for above command line magic).

If you have interest in sysvinit, you (because I don't use sysvinit) should now test if this approach works.

Get the patch proposed to add a task-sysvinit package.
Build an installer CD with this tasksel (maybe this documentation is helpful for this step).
Test whether the patch works. Report results to above bug report, so that others interested in sysvinit can find them easily.
Find and fix bugs if it didn't work. Repeat.
Publish the modified ("forked") installer, and get user feedback.

If you are then still up for a fight, you can try to convince the maintainers (or go the nasty way, and ask the CTTE for their opinion, to start another flamewar and make more maintainers give up) that this option should be added to the mainline installer. And hurry up, or you may at best get this into Jessie reloaded, 8.1. - chance are that the release manager will not accept such patches this late anymore. The sysvinit supporters should have investigated this option much, much earlier instead of losing time on the GR.

Again, I won't be doing this job for you. I'm happy with systemd. But patches and proof-of-concept is what makes open source work, not GRs and MikeeUSA's crap videos spammed to the LKML...

(And yes, I am quite annoyed by the way the Debian installer maintainers handled the bug report. This is not how open-source collaboration is supposed to work. I tried to file a proper wishlist bug reporting, suggesting a solution that I could not find discussed anywhere before and got back just this "No. Shut up." answer. I'm not sure if I will be reporting a bug in debian-installer ever again, if this is the way they handle bug reports ...)

I do care about our users, though. If you look at popcon "vote" results, we have 4179 votes for sysvinit-core and 16918 votes for systemd-sysv (graph) indicating that of those already testing jessie and beyond - neglecting 65 upstart votes, and assuming that there is no bias to not-upgrade if you prefer sysvinit - about 20% appear to prefer sysvinit (in fact, they may even have manually switched back to sysvinit after being upgraded to systemd unintentionally?). These are users that we should listen to, and that we should consider adding an installer option for, too.

22 November 2014

Petter Reinholdtsen: How to stay with sysvinit in Debian Jessie

By now, it is well known that Debian Jessie will not be using sysvinit as its boot system by default. But how can one keep using sysvinit in Jessie? It is fairly easy, and here are a few recipes, courtesy of Erich Schubert and Simon McVittie. If you already are using Wheezy and want to upgrade to Jessie and keep sysvinit as your boot system, create a file /etc/apt/preferences.d/use-sysvinit with this content before you upgrade:

Package: systemd-sysv
Pin: release o=Debian
Pin-Priority: -1

This file content will tell apt and aptitude to not consider installing systemd-sysv as part of any installation and upgrade solution when resolving dependencies, and thus tell it to avoid systemd as a default boot system. The end result should be that the upgraded system keep using sysvinit. If you are installing Jessie for the first time, there is no way to get sysvinit installed by default (debootstrap used by debian-installer have no option for this), but one can tell the installer to switch to sysvinit before the first boot. Either by using a kernel argument to the installer, or by adding a line to the preseed file used. First, the kernel command line argument:

preseed/late_command="in-target apt-get install --purge -y sysvinit-core"

Next, the line to use in a preseed file:

d-i preseed/late_command string in-target apt-get install -y sysvinit-core

One can of course also do this after the first boot by installing the sysvinit-core package. I recommend only using sysvinit if you really need it, as the sysvinit boot sequence in Debian have several hardware specific bugs on Linux caused by the fact that it is unpredictable when hardware devices show up during boot. But on the other hand, the new default boot system still have a few rough edges I hope will be fixed before Jessie is released. Update 2014-11-26: Inspired by a blog post by Torsten Glaser, added --purge to the preseed line.

19 November 2014

Erich Schubert: What the GR outcome means for the users

The GR outcome is: no GR necessary

This is good news.

Because it says: Debian will remain Debian, as it was the last 20 years.

For 20 years, we have tried hard to build the "universal operating system", and give users a choice. We've often had alternative software in the archive. Debian has come up with various tool to manage alternatives over time, and for example allows you to switch the system-wide Java.

You can still run Debian with sysvinit. There are plenty of Debian Developers which will fight for this to be possible in the future.

The outcome of this resolution says:

Using a GR to force others is the wrong approach of getting compatibility.
We've offered choice before, and we trust our fellow developers to continue to work towards choice.
Write patches, not useless GRs. We're coders, not bureocrats.
We believe we can do this, without making it a formal MUST requirement. Or even a SHOULD requirement. Just do it.

The sysvinit proponents may perceive this decision as having "lost". But they just don't realize they won, too. Because the GR may easily have backfired on them. The GR was not "every package must support sysvinit". It was also "every sysvinit package must support systemd". Here is an example: eudev, a non-systemd fork of udev. It is not yet in Debian, but I'm fairly confident that someone will make a package of it after the release, for the next Debian. Given the text of the GR, this package might have been inappropriate for Debian, unless it also supports systemd. But systemd has it's own udev - there is no reason to force eudev to work with systemd, is there?

Debian is about choice. This includes the choice to support different init systems as appropriate. Not accepting a proper patch that adds support for a different init would be perceived as a major bug, I'm assured.

A GR doesn't ensure choice. It only is a hammer to annoy others. But it doesn't write the necessary code to actually ensure compatibility.

If GNOME at some point decides that systemd as pid 1 is a must, the GR only would have left us three options: A) fork the previous version, B) remove GNOME altogether, C) remove all other init systems (so that GNOME is compliant). Does this add choice? No.

Now, we can preserve choice: if GNOME decides to go systemd-pid1-only, we can both include a forked GNOME, and the new GNOME (depending on systemd, which is allowed without the GR). Or any other solution that someone codes and packages...

Don't fear that systemd will magically become a must. Trust that the Debian Developers will continue what they have been doing the last 20 years. Trust that there are enough Debian Developers that don't run systemd. Because they do exist, and they'll file bugs where appropriate. Bugs and patches, that are the appropriate tools, not GRs (or trolling).

18 November 2014

Erich Schubert: Generate iptables rules via pyroman

Vincent Bernat blogged on using Netfilter rulesets, pointing out that inserting the rules one-by-one using iptables calls may leave your firewall temporarily incomplete, eventually half-working, and that this approach can be slow.

He's right with that, but there are tools that do this properly. ;-)

Some years ago, for a multi-homed firewall, I wrote a tool called Pyroman. Using rules specified either in Python or XML syntax, it generates a firewall ruleset for you.

But it also adresses the points Vincent raised:

It uses iptables-restore to load the firewall more efficiently than by calling iptables a hundred times
It will backup the previous firewall, and roll-back on errors (or lack of confirmation, if you are remote and use --safe)

It also has a nice feature for the use in staging: it can generate firewall rule sets offline, to allow you reviewing them before use, or transfer them to a different host. Not all functionality is supported though (e.g. the Firewall.hostname constant usable in python conditionals will still be the name of the host you generate the rules on - you may want to add a --hostname parameter to pyroman)

pyroman --print-verbose will generate a script readable by iptables-restore except for one problem: it contains both the rules for IPv4 and for IPv6, separated by #### IPv6 rules. It will also annotate the origin of the rule, for example:

# /etc/pyroman/02_icmpv6.py:82
-A rfc4890f -p icmpv6 --icmpv6-type 255 -j DROP

indicates that this particular line was produced due to line 82 in file /etc/pyroman/02_icmpv6.py. This makes debugging easier. In particular it allows pyroman to produce a meaningful error message if the rules are rejected by the kernel: it will tell you which line caused the rule that was rejected.

For the next version, I will probably add --output-ipv4 and --output-ipv6 options to make this more convenient to use. So far, pyroman is meant to be used on the firewall itself.

Note: if you have configured a firewall that you are happy with, you can always use iptables-save to dump the current firewall. But it will not preserve comments, obviously.

9 November 2014

Erich Schubert: GR vote on init coupling

Overregulation is bad, and the project is suffering from the recent Anti-Systemd hate campaigning.

There is nothing balanced about the original GR proposal. It is bullshit from a policy point of view (it means we must remove software from Debian that would not work with other inits, such as gnome-journal, by policy).At the same time, it uses manipulative language like "freedom to select a different init system" (as if this would otherwise be impossible) and "accidentally locked in". It is exactly this type of language and behavior which has made Debian quite poisonous the last months.

In fact, the GR pretty much says "I don't trust my fellow maintainers to do the right thing, therefore I want a new hammer to force my opinion on them". This is unacceptable in my opinion, and the GR will only demotivate contributors. Every Debian developer (I'm not talking about systemd upstream, but about Debian developers!) I've met would accept a patch that adds support for sysvinit to a package that currently doesn't. The proposed GR will not improve sysvinit support. It is a hammer to kick out software where upstream doesn't want to support sysvinit, but it won't magically add sysvinit support anywhere.

What some supporters of the GR may not have realized - it may as well backfire on them. Some packages that don't yet work with systemd would violate policy then, too... - in my opinion, it is much better to make the support on a "as good as possible, given available upstream support and patches" basis, instead of a "must" basis. The lock-in may come even faster if we make init system support mandatory: it may be more viable to drop software to satisfy the GR than to add support for other inits - and since systemd is the current default, software that doesn't support systemd are good candidates to be dropped, aren't they? (Note that I do prefer to keep them, and have a policy that allows keeping them ...)

For these reasons I voted:

Choice 4: GR not required
Choice 3: Let maintainers do their work
Choice 2: Recommended, but not mandatory
Choice 5: No decision
Choice 1: Ban packages that don't work with every init system

Fact is that Debian maintainers have always been trying hard to allow people to choose their favorite software. Until you give me an example where the Debian maintainer (not upstream) has refused to include sysvinit support, I will continue to trust my fellow DDs. I've been considering to place Choice 2 below "further discussion", but essentially this is a no-op GR anyway - in my opinion "should support other init systems" is present in default Debian policy already anyway...

Say no to the haters.

And no, I'm not being unfair. One of the most verbose haters going by various pseudonyms such as Gregory Smith (on Linux Kernel Mailing list), Brad Townshend (LKML) and John Garret (LKML) has come forward with his original alias - it is indeed MikeeUSA, a notorious anti-feminist troll (see his various youtube "songs", some of them include this pseudonym). It's easy to verify yourself.

He has not contributed anything to the open source community. His songs and "games" are not worth looking at, and I'm not aware of any project that has accepted any of his "contributions". Yet, he uses several sock puppets to spread his hate.

The anti-systemd "crowd" (if it acually is more than a few notorious trolls) has lost all its credibility in my opinion. They spread false information, use false names, and focus on hate instead of improving source code. And worse, they tolerate such trolling in their ranks.

23 October 2014

Erich Schubert: Clustering 23 mio Tweet locations

To test scalability of ELKI, I've clustered 23 million Tweet locations from the Twitter Statuses Sample API obtained over 8.5 months (due to licensing restrictions by Twitter, I cannot make this data available to you, sorry.

23 million points is a challenge for advanced algorithms. It's quite feasible by k-means; in particular if you choose a small k and limit the number of iterations. But k-means does not make a whole lot of sense on this data set - it is a forced quantization algorithm, but does not discover actual hotspots.

Density-based clustering such as DBSCAN and OPTICS are much more appropriate. DBSCAN is a bit tricky to parameterize - you need to find the right combination of radius and density for the whole world. Given that Twitter adoption and usage is quite different it is very likely that you won't find a single parameter that is appropriate everywhere.

OPTICS is much nicer here. We only need to specify a minimum object count - I chose 1000, as this is a fairly large data set. For performance reasons (and this is where ELKI really shines) I chose a bulk-loaded R*-tree index for acceleration. To benefit from the index, the epsilon radius of OPTICS was set to 5000m. Also, ELKI allows using geodetic distance, so I can specify this value in meters and do not get much artifacts from coordinate projection.

To extract clusters from OPTICS, I used the Xi method, with xi set to 0.01 - a rather low value, also due to the fact of having a large data set.

The results are pretty neat - here is a screenshot (using KDE Marble and OpenStreetMap data, since Google Earth segfaults for me right now):

Screenshot of Clusters in central Europe

Some observations: unsurprisingly, many cities turn up as clusters. Also regional differences are apparent as seen in the screenshot: plenty of Twitter clusters in England, and low acceptance rate in Germany (Germans do seem to have objections about using Twitter; maybe they still prefer texting, which was quite big in Germany - France and Spain uses Twitter a lot more than Germany).

Spam - some of the high usage in Turkey and Indonesia may be due to spammers using a lot of bots there. There also is a spam cluster in the ocean south of Lagos - some spammer uses random coordinates [0;1]; there are 36000 tweets there, so this is a valid cluster...

A benefit of OPTICS and DBSCAN is that they do not cluster every object - low density areas are considered as noise. Also, they support clusters of different shape (which may be lost in this visualiation, which uses convex hulls!) and different size. OPTICS can also produce a hierarchical result.

Note that for these experiments, the actual Tweet text was not used. This has a rough correspondence to Twitter popularity "heatmaps", except that the clustering algorithms will actually provide a formalized data representation of activity hotspots, not only a visualization.

You can also explore the clustering result in your browser - the Google Drive visualization functionality seems to work much better than Google Earth.

If you go to Istanbul or Los Angeles, you will see some artifacts - odd shaped clusters with a clearly visible spike. This is caused by the Xi extraction of clusters, which is far from perfect. At the end of a valley in the OPTICS plot, it is hard to decide whether a point should be included or not. These errors are usually the last element in such a valley, and should be removed via postprocessing. But our OpticsXi implementation is meant to be as close as possible to the published method, so we do not intend to "fix" this.

Certain areas - such as Washington, DC, New York City, and the silicon valley - do not show up as clusters. The reason is probably again the Xi extraction - these region do not exhibit the steep density increase expected by Xi, but are too blurred in their surroundings to be a cluster.

Hierarchical results can be found e.g. in Brasilia and Los Angeles.

Compare the OPTICS results above to k-means results (below) - see why I consider k-means results to be a meaningless quantization?

Sure, k-means is fast (30 iterations; not converged yet. Took 138 minutes on a single core, with k=1000. The parallel k-means implementation in ELKI took 38 minutes on a single node, Hadoop/Mahout on 8 nodes took 131 minutes, as slow as a single CPU core!). But you can see how sensitive it is to misplaced coordinates (outliers, but mostly spam), how many "clusters" are somewhere in the ocean, and that there is no resolution on the cities? The UK is covered by 4 clusters, with little meaning; and three of these clusters stretch all the way into Bretagne - k-means clusters clearly aren't of high quality here.

If you want to reproduce these results, you need to get the upcoming ELKI version (0.6.5~201410xx - the output of cluster convex hulls was just recently added to the default codebase), and of course data. The settings I used are:

-dbc.in coords.tsv.gz
-db.index tree.spatial.rstarvariants.rstar.RStarTreeFactory
-pagefile.pagesize 500
-spatial.bulkstrategy SortTileRecursiveBulkSplit
-time
-algorithm clustering.optics.OPTICSXi
-opticsxi.xi 0.01
-algorithm.distancefunction geo.LngLatDistanceFunction
-optics.epsilon 5000.0 -optics.minpts 1000
-resulthandler KMLOutputHandler -out /tmp/out.kmz

and the total runtime for 23 million points on a single core was about 29 hours. The indexes helped a lot: less than 10000 distances were computed per point, instead of 23 million - the expected speedup over a non-indexed approach is 2400.

Don't try this with R or Matlab. Your average R clustering algorithm will try to build a full distance matrix, and you probably don't have an exabyte of memory to store this matrix. Maybe start with a smaller data set first, then see how long you can afford to increase the data size.

21 October 2014

Erich Schubert: Avoiding systemd isn't hard

Don't listen to trolls. They lie.

Debian was and continues to be about choice. Previously, you could configure Debian to use other init systems, and you can continue to do so in the future.

In fact, with wheezy, sysvinit was essential. In the words of trolls, Debian "forced" you to install SysV init!

With jessie, it will become easier to choose the init system, because neither init system is essential now. Instead, there is an essential meta-package "init", which requires you to install one of systemd-sysv sysvinit-core upstart. In other words, you have more choice than ever before.

Again: don't listen to trolls.

However, notice that there are some programs such as login managers (e.g. gdm3) which have an upstream dependency on systemd. gdm3 links against libsystemd0 and depends on libpam-systemd; and the latter depends on systemd-sysv systemd-shim so it is in fact a software such as GNOME that is pulling systemd onto your computer.

IMHO you should give systemd a try. There are some broken (SysV-) init scripts that cause problems with systemd; but many of these cases have now been fixed - not in systemd, but in the broken init script.

However, here is a clean way to prevent systemd from being installed when you upgrade to jessie. (No need to "fork" Debian for this, which just demonstrates how uninformed some trolls are ... - apart from Debian being very open to custom debian distributions, which can easily be made without "forking".)

As you should know, apt allows version pinning. This is the proper way to prevent a package from being installed. All you need to do is create a file named e.g. /etc/apt/preferences.d/no-systemd with the contents:

Package: systemd-sysv
Pin: release o=Debian
Pin-Priority: -1

from the documentation, a priority less than 0 disallows the package from being installed. systemd-sysv is the package that would enable systemd as your default init (/sbin/init).

This change will make it much harder for aptitude to solve dependencies. A good way to help it to solve the dependencies is to install the systemd-shim package explicitly first:

aptitude install systemd-shim

After this, I could upgrade a Debian system from wheezy to jessie without being "forced" to use systemd...

In fact, I could also do an aptitude remove systemd systemd-shim. But that would have required the uninstallation of GNOME, gdm3 and network-manager - you may or may not be willing to do this. On a server, there shouldn't be any component actually depending on systemd at all. systemd is mostly a GNOME-desktop thing as of now.

As you can see, the trolls are totally blaming the wrong people, for the wrong reasons... and in fact, the trolls make up false claims (as a fact, systemd-shim was updated on Oct 14). Stop listening to trolls, please.

If you find a bug - a package that needlessly depends on systemd, or a good way to remove some dependency e.g. via dynamic linking, please contribute a patch upstream and file a bug. Solve problems at the package/bug level, instead of wasting time doing hate speeches.

18 October 2014

Erich Schubert: Beware of trolls - do not feed

A particularly annoying troll has been on his hate crusade against systemd for months now.

Unfortunately, he's particularly active on Debian mailing lists (but apparently also on Ubuntu and the Linux Kernel mailing list) and uses a tons of fake users he keeps on setting up. Our listmasters have a hard time blocking all his hate, sorry.

Obviously, this is also the same troll that has been attacking Lennart Poettering.

There is evidence that this troll used to go by the name "MikeeUSA", and has quite a reputation with anti-feminist hate for over 10 years now.

Please, do not feed this troll.

Here are some names he uses on YouTube: Gregory Smith, Matthew Bradshaw, Steve Stone.

Blacklisting is the best measure we have, unfortunately.

Even if you don't like the road systemd is taking or Lennart Poetting personall - the behaviour of that troll is unacceptable to say the least; and indicates some major psychological problems... also, I wouldn't be surprised if he is also involved in #GamerGate.

See this example (LKML) if you have any doubts. We seriously must not tolerate such poisonous people.

If you don't like systemd, the acceptable way of fighting it is to write good alternative software (and you should be able to continue using SysV init or openRC, unless there is a bug, in Debian - in this case, provide a bug fix). End of story.

17 October 2014

Erich Schubert: Google Earth on Linux

Google Earth for Linux appears to be largely abandoned by Google, unfortunately. The packages available for download cannot be installed on a modern amd64 Debian or Ubuntu system due to dependency issues.

In fact, the adm64 version is a 32 bit build, too. The packages are really low quality, the dependencies are outdated, locales support is busted etc.

So here are hacky instructions how to install nevertheless. But beware, these instructions are a really bad hack.

These instructions are appropriate for version 7.1.2.2041-r0. Do not use them for any other version. Things will have changed.
Make sure your system has i386 architecture enabled. Follow the instructions in section "Configuring architectures" on the Debian MultiArch Wiki page to do so
Install lsb-core, and try to install the i386 versions of these packages, too!
Download the i386 version of the Google Earth package

Install the package by forcing dependencies, via

sudo dpkg --force-depends -i google-earth-stable_current_i386.deb

As of now, your package manager will complain, and suggest to remove the package again. To make it happy, we have to hack the installed packages list. This is ugly, and you should make a backup. You can totally bust your system this way... Fortunately, the change we're doing is rather simple. As admin, edit the file /var/lib/dpkg/status. Locate the section Package: google-earth-stable. In this section, delete the line starting with Depends:. Don't add in extra newlines or change anything else!
Now the package manager should believe the dependencies of Google Earth are fulfilled, and no longer suggest removal. But essentially this means you have to take care of them yourself!

Some notes on using Google Earth:

Locales are busted. Use LC_NUMERIC=en_US.UTF-8 google-earth to start it. Otherwise, it will fail parsing coordinates, if you are in a locale that uses a different number format.
You may need to install the i386 versions of some libraries, in particular of your OpenGL drivers! I cannot provide you with a complete list.
Search doesn't work sometimes for me.
Occassionally, it reports "unknown" network errors.
If you upgrade Nvidia graphics drivers, you will usually have to reboot, or you will see graphics errors.
Some people have removed/replaced the bundled libQt* and libfreeimage* libraries, but that did not work for me.

25 August 2014

Erich Schubert: Analyzing Twitter - beware of spam

This year I started to widen up my research; and one data source of interest was text because of the lack of structure in it, that makes it often challenging. One of the data sources that everybody seems to use is Twitter: it has a nice API, and few restrictions on using it (except on resharing data). By default, you can get a 1% random sample from all tweets, which is more than enough for many use cases.

We've had some exciting results which a colleague of mine will be presenting tomorrow (Tuesday, Research 22: Topic Modeling) at the KDD 2014 conference:

SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds
Erich Schubert, Michael Weiler, Hans-Peter Kriegel
20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

You can also explore some (static!) results online at signi-trend.appspot.com

In our experiments, the "news" data set was more interesting. But after some work, we were able to get reasonable results out of Twitter as well. As you can see from the online demo, most of these fall into pop culture: celebrity deaths, sports, hip-hop. Not much that would change our live; and even less that wasn't before captured by traditional media.

The focus of this post is on the preprocessing needed for getting good results from Twitter. Because it is much easier to get bad results!

The first thing you need to realize about Twitter is that due to the media attention/hype it gets, it is full of spam. I'm pretty sure the engineers at Twitter already try to reduce spam; block hosts and fraud apps. But a lot of the trending topics we discovered were nothing but spam.

Retweets - the "like" of Twitter - are an easy source to see what is popular, but are not very interesting if you want to analyze text. They just reiterate the exact same text (except for a "RT " prefix) than earlier tweets. We found results to be more interesting if we removed retweets. Our theory is that retweeting requires much less effort than writing a real tweet; and things that are trending "with effort" are more interesting than those that were just liked.

Teenie spam. If you ever searched for a teenie idol on Twitter, say this guy I hadn't heard of before, but who has 3.88 million followers on Twitter, and search for Tweets addressed to him, you will get millions over millions of results. Many of these tweets look like this:

@5SOS @Luke5SOS @Ashton5SOS @Michael5SOS @Calum5SOS CAN YOU FOLLOW M ? It would mean the world to me! Make my dream come true! x804 Chantal (@Chantal_San22) 25. August 2014

Now if you look at this tweet, there is this odd "x804" at the end. This is to defeat a simple spam filter by Twitter. Because this user did not tweet this just once: instead it is common amongst teenie to spam their idols with follow requests by the dozen. Probably using some JavaScript hack, or third party Twitter client. Occassionally, you see hundreds of such tweets, each sent within a few seconds of the previous one. If you get a 1% sample of these, you still get a few then...

Even worse (for data analysis) than teenie spammers are commercial spammers and wannabe "hackers" that exercise their "sk1llz" by spamming Twitter. To get a sample of such spam, just search for weight loss on Twitter. There is plenty of fresh spam there, usually consisting of some text pretending to be news, and an anonymized link (there is no need to use an URL shortener such as bit.ly on Twitter, since Twitter has its own URL shortener t.co; and you'll end up with double-shortened URLs). And the hacker spam is even worse (e.g. #alvianbencifa) as he seems to have trojaned hundreds of users, and his advertisement seems to be a nonexistant hash tag, which he tries to get into Twitters "trending topics".

And then there are the bots. Plenty of bots spam Twitter with their analysis of trending topics, reinforcing the trending topics. In my opinion, bots such as trending topics indonesia are useless. No wonder there are only 280 followers. And of the trending topics reported, most of them seem to be spam topics...

Bottom line: if you plan on analyzing Twitter data, spend considerable time on preprocessing to filter out spam of various kind. For example, we remove singletons and digits, then feed the data through a duplicate detector. We end up discarding 20%-25% of Tweets. But we still get some of the spam, such as that hackers spam.

All in all, real data is just messy. People posting there have an agenda that might be opposite to yours. And if someone (or some company) promises you wonders from "Big data" and "Twitter", you better have them demonstrate their use first, before buying their services. Don't trust visions of what could be possible, because the first rule of data analysis is: Garbage in, garbage out.

18 June 2014

Ritesh Raj Sarraf: Laptop Mode Tools 1.65

I am very pleased to announce the release of Laptop Mode Tools, at version 1.65 This release took a toll given things have been changing for me, both personally and professionally. 1.64 was released on September 1st, 2013. So it was a full 9 month period, of which a good 2-3 months were procrastination. That said, this release has some pretty good bug fixes and I urge all distribution packagers to push it to their repositories soon. While I'd thank all contributors who have helped make this release, a special thank you to Stefan Huber. Stefan found/fixed many issues, did the messy code clean up etc.. Thank you. Worthy changes are mentioned below. For full details, please refer to the git commit logs. 1.65 - Wed Jun 18 19:22:35 IST 2014
* fix grep error on missing $device/uevent
* ethernet: replace sysfs/enabled by 'ip link down'
* wireless-iwl-power: sysfs attr enbable -> enabled
* wireless-iwl-power: Add iwlwifi support
* Use Runtime Power Managemet Framework is more robust now. Deprecates module
usb-autosuspend
* Fix multiple hibernate issue
* When resuming, run LMT in force initialization mode
* Add module for Intel PState driver
* GUI: Implement suspend/hibernate interface

Categories:
Debian-Blog
Technology
Tools

Keywords:
laptop-mode-tools

RHUT

22 April 2014

Erich Schubert: Kernel-density based outlier detection and the need for customization

Outlier detection (also: anomaly detection, change detection) is an unsupervised data mining task that tries to identify the unexpected.

Most outlier detection methods are based on some notion of density: in an appropriate data representation, "normal" data is expected to cluster, and outliers are expected to be further away from the normal data.

This intuition can be quantified in different ways. Common heuristics include kNN outlier detection and the Local Outlier Factor (which uses a density quotient). One of the directions in my dissertation was to understand (also from a statistical point of view) how the output and the formal structure of these methods can be best understood.

I will present two smaller results of this analysis at the SIAM Data Mining 2014 conference: instead of the very heuristic density estimation found in above methods, we design a method (using the same generalized pattern) that uses a best-practise from statistics: Kernel Density Estimation. We aren't the first to attempt this (c.f. LDF), but we actuall retain the properties of the kernel, whereas the authors of LDF tried to mimic the LOF method too closely, and this way damaged the kernel.

The other result presented in this work is the need to customize. When working with real data, using "library algorithms" will more often than not fail. The reason is that real data isn't as nicely behaved - it's dirty, it seldom is normal distributed. And the problem that we're trying to solve is often much narrower. For best results, we need to integrate our preexisting knowledge of the data into the algorithm. Sometimes we can do so by preprocessing and feature transformation. But sometimes, we can also customize the algorithm easily.

Outlier detection algorithms aren't black magic, or carefully adjusted. They follow a rather simple logic, and this means that we can easily take only parts of these methods, and adjust them as necessary for our problem at hand!

The article persented at SDM will demonstrate such a use case: analyzing 1.2 million traffic accidents in the UK (from data.gov.uk) we are not interested in "classic" density based outliers - this would be a rare traffic accident on a small road somewhere in Scotland. Instead, we're interested in unusual concentrations of traffic accidents, i.e. blackspots.

The generalized pattern can be easily customized for this task. While this data does not allow automatic evaluation, many outliers could be easily verified using Google Earth and search: often, historic imagery on Google Earth showed that the road layout was changed, or that there are many news reports about the dangerous road. The data can also be nicely visualized, and I'd like to share these examples with you. First, here is a screenshot from Google Earth for one of the hotspots (Cherry Lane Roundabout, North of Heathrow airport, which used to be a double cut-through roundabout - one of the cut-throughs was removed since):

Screenshot of Cherry Lane Roundabout hotspot

Google Earth is best for exploring this result, because you can hide and show the density overlay to see the crossroad below; and you can go back in time to access historic imagery. Unfortunately, KML does not allow easy interactions (at least it didn't last time I checked).

I have also put the KML file on Google Drive. It will automatically display it on Google Maps (nice feature of Drive, kudos to Google!), but it should also allow you to download it. I've also explored the data on an Android tablet (but I don't think you can hide elements there, or access historic imagery as in the desktop application).

With a classic outlier detection method, this analysis would not have been possible. However, it was easy to customize the method; and the results are actually more meaningful: instead of relying on some heuristic to choose kernel bandwidth, I opted for choosing the bandwidth by physical arguments: 50 meters is a reasonable bandwidth for a crossroad / roundabout, and for comparison a radius of 2 kilometers is used to model the typical accident density in this region (there should other crossroads within 2 km in Europe).

Since I advocate reproducible science, the source code of the basic method will be in the next ELKI release. For the customization case studies, I plan to share them as a how-to or tutorial type of document in the ELKI wiki; probably also detailing data preprocessing and visualization aspects. The code for the customizations is not really suited for direct inclusion in the ELKI framework, but can serve as an example for advanced usage.

Reference:
E. Schubert, A. Zimek, H.-P. Kriegel
Generalized Outlier Detection with Flexible Kernel Density Estimates
In Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, 2014.

So TLDR of the story: A) try to use more established statistics (such as KDE), and B) don't expect an off-the-shelf solution to do magic, but customize the method for your problem.

P.S. if you happen to know nice post-doc positions in academia:
I'm actively looking for a position to continue my research. I'm working on scaling these methods to larger data and to make them work with various real data that I can find. Open-source, modular and efficient implementations are very important to me, and one of the directions I'd like to investigate is porting these methods to a distributed setting, for example using Spark. In order to get closer to "real" data, I've started to make these approaches work e.g. on textual data, mixed type data, multimedia etc. And of course, I like teaching; which is why I would prefer a position in academia.

13 February 2014

Erich Schubert: Google Summer of Code 2014 - not participating

Google Summer of Code 2014 is currently open for mentoring organizations to register. I've decided to not apply for GSoC with my data mining open source project ELKI anymore.

You don't get any feedback for your application. So unless you were accepted, you have no idea if it is worth trying again (I tried twice).
As far as I can tell, you only have a chance if there is someone at Google advocating your project. For example, someone involved in your project actually working at Google.
I don't really trust Google anymore. They've been too much exploiting their market dominance, leveraging YouTube and Android to force people into Google+ etc. (see Google turning evil blog post).
I'll be looking for a Post-Doc position in fall, so the timing of the GSoC isn't ideal anyway.

11 February 2014

Erich Schubert: Debian chooses systemd as default init

It has been all over the place. The Debian CTTE has chosen systemd over upstart as default init system by chairman call. This decision was overdue, as it was pretty clear that the tie will not change, and thus it will be up to chairman. There were no new positions presented, and nobody was being convinced of a different preference. The whole discussion had been escalating, and had started to harm Debian. Some people may not want to hear this, but another 10 ballots and options would not have changed this outcome. Repating essentially the same decision (systemd, upstart, or "other things nobody prefers") will do no good, but turn out the same result, a tie. Every vote counting I saw happen would turn out this tie. Half of the CTTE members prefer systemd, the other half upstart. The main problems, however, are these:

People are not realizing this is about the default init for jessie, not about the only init to support. Debian was and is about choice, but even then you need to make something default... If you prefer SysV init or openRC, you will still be able to use it (it will be just fewer people debugging these startup scripts).
Some people (not part of the CTTE) are just social inept trolls, and have rightfully been banned from the list.
The discussion has been so heated up, a number of imporant secondary decisions have not yet been discussed in a civil manner. For example, we will have some packages (for example, Gnome Logs), which are specific to a particular init system, but not part of the init system. A policy needs to be written with these cases in mind that states when and how a package may depend on a particular init system. IMHO, a package that cannot support other init systems without major changes should be allowed to depend on that system, but then meta-packages such as the Gnome meta packages must not require this application.

So please, everybody get back to work now. If there ever is enough reason to overturn this decision, there are formal ways to do this, and there are people in the CTTE (half of the CTTE, actually) that will take care of this. Until then, live with the result of a 0.1 votes lead for systemd. Instead of pursuing a destructive hate campaign, why not improve your favorite init system instead. Oh, and don't forget about the need to spell out a policy for init system support requirements of packages.

9 February 2014

Erich Schubert: Minimizing usage of debian-multimedia packages

How to reduce your usage of debian-multimedia packages: As you might be aware, the "deb-multimedia" packages have seen their share of chaos. Originally, they were named "debian-multimedia", but it was then decided that they should better be called "deb-multimedia" as they are not an official part of Debian. The old domain was then grabbed by cybersquatters when it expired. While a number of packages remain indistributable for Debian due to legal issues - such as decrypting DVDs - and thus "DMO" remains useful to many desktop users, please note that for many of the packages, a reasonable version exists within Debian "main". So here is a way to prevent automatically upgrading to DMO versions of packages that also exist in Debian main: We will use a configuration option known as apt pinning. We will modify the priority of DMO package to below 100, which means they will not be automatically upgraded to; but they can be installed when e.g. no version of this package exists within Debian main. I.e. the packages can be easily installed, but it will prefer using the official versions. For this we need to create a file I named /etc/apt/preferences.d/unprefer-dmo with the contents:

Package: *
Pin: release o=Unofficial Multimedia Packages
Pin-Priority: 123

As long as the DMO archive doesn't rename itself, this pin will work; it will continue to work if you use a mirror of the archive, as it is not tied to the URL. It will not downgrade packages for you. If you want to downgrade, this will be a trickier process. You can use aptitude for this, and start with a filter (l key, as in "limit") of ?narrow(~i,~Vdmo). This will only show packages installed where the version number contains dmo. You can now patrol this list, enter the detail view of each package, and check the version list at the end for a non-dmo version. I cannot claim this will be an easy process. You'll probably have a lot of conflicts to clear up, due to different libavcodec* API versions. If I recall correctly, I opted to uninstall some dmo packages such as ffmpeg that I do not really need. In other cases, the naming is slightly different: Debian main has handbrake, while dmo has handbrake-gtk. A simple approach is to consider uninstalling all of these packages. Then reinstall as needed; since installing Debian packages is trivial, it does not hurt to deinstall something, does it? When reinstalling, Debian packages will be preferred over DMO packages. I prefer to use DFSG-free software wherever possible. And as long as I can watch my favorite series (Tatort, started in 1970 and airing episode 900 this month) and the occasional DVD movie, I have all I need. An even more powerful aptitude filter to review installed non-Debian software is ?narrow(~i,!~ODebian), or equivalently ?narrow(?installed,?not(?origin(Debian))). This will list all package versions, which cannot be installed from Debian sources. In particular, this includes versions that are no longer available (they may have been removed because of security issues!), software that has been installed manually from .deb files, or any other non-Debian source such as Google or DMO. (You should check the output of aptitude policy that no source claims to be o=Debian that isn't Debian though). This filter is a good health check for your system. Debian packages receive a good amount of attention with respect to packaging quality, maintainability, security and all of these aspects. Third party packages usually failed the Debian quality check in at least one respect. For example, Sage isn't in Debian yet; mostly because it has so many dependencies, that making a reliable and upgradeable installation is a pain. Similarly, there is no Hadoop in Debian yet. If you look at Hadoop packaging efforts such as Apache Bigtop, they do not live up to Debian quality yet. In particular, the packages have high redundancy, and re-include various .jar copies instead of sharing access to an independently packaged dependency. If a security issue arises with any of these jars, all the Hadoop packages will need to be rebuilt. As you can see, it is usually not because of Debian being too strict or too conservative about free software when software is not packaged. More often than not, it's because the software in question itself currently does not live up to Debian packaging quality requirements. And of course sometimes because there is too little community backing the software in question, that would improve the packaging. If you want your software to be included in Debian, try to make it packaging friendly. Don't force-include copies of other software, for example. Allow the use of system-installed libraries where possible, and provide upgrade paths and backwards compatibility. Make dependencies optional, so that your software can be pacakged even if a dependency is not yet available - and does not have to be removed if the dependency becomes a security risk. Maybe learn how to package yourself, too. This will make it easier for you to push new features and bug fixes to your userbase: instead of requiring your users to manually upgrade, make your software packaging friendly. Then your users will be auto-upgraded by their operating system to the latest version. Update: If you are using APT::Default-Release or other pins, these may override above pin. You may need to use apt-cache policy to find out if the pins are in effect, and rewrite your default-release pin into e.g.

Package: *
Pin: release a=testing,o=Debian
Pin-Priority: 912

for debugging, use a different priority for each pin, so you can see which one is in effect. Notice the o=Debian in this pin, which makes it apply only to Debian repositories.

Next.

Previous.